From Messy Data to Research Insights

Data Analytics for Finance

Caspar David Peter

Rotterdam School of Management, Accounting Department

Introduction

Introduction

Who am I?


(c)RSM
Caspar David Peter
  • 🏢     Rotterdam School of Management
  • 👷     Associate Professor
  • 🏫     Doctorate (WHU)
  • 🏫     Dipl.-Ök. (RUB)
  • 👶     Bochum, Germany
  • 🏠     Rotterdam, Netherlands
  • 🍞🧈 Economic consequences of transparency

Setting the Stage

Setting the Stage

Why are we here?

The struggle is real

What we gonna do about it?

  • L1-2: Data wrangling & visualization (the foundation)
  • L3-4: Regression & panel methods (the workhorse)
  • L5: Event studies & Fama-MacBeth (specialized tools)
  • L6: Replication workshop (putting it together)

Setting the Stage

Logistics

Lectures

  • Dates & Times & Rooms: Timetable
  • Fomat: 6 × ~ 3-hour sessions
  • No mandatory attendance, but highly recommended!

Resources

  • Course Canvas page
  • Course website
  • Textbooks: Huntington-Klein (2022) and Verbeek (2021)
  • Required readings (see lecture slides)

Setting the Stage

Logistics

Assignments

  • 6 assignments (weekly, after each lecture)
  • Assignment 6 is a “replication” project
  • Access via browser (JupyterHub)
  • No local installation needed!
  • Deadlines/Details in assignment documents
  • We do a walkthrough later today

How to succeed

  • Complete assignments on time
  • Grading
    • All assignments must be completed
    • Assignmnets are equally weighted
    • Fail if overall assignmnet score < 5.5

Setting the Stage

Philosophy of the course

This is NOT a coding course!

What this course is NOT

  • A Stata coding tutorial
  • A software certification
  • About memorizing syntax

What this course IS

  • A course about research design and empirical methods
  • Learning how to think about data and causality EVERYWHERE
  • Understanding what steps you need to take and why
  • A fallback when you get stuck in your thesis work

Setting the Stage

Elephant in the room: Stata

The role of coding

  • Implementation is how we deepen understanding
  • Hands-on practice makes abstract concepts concrete
  • But if you understand the concepts, you can implement in any language (Python, R, Stata, Julia…)

Why Stata?

  • Research lingua franca in finance/accounting—most papers use it
  • Your thesis supervisor likely uses it
  • Replication packages are overwhelmingly in Stata

Setting the Stage

Takeaway

If you understand what you’re trying to do and why, the coding is the easy part. If you don’t understand the research design, no amount of syntax knowledge will save you.

On (NOT) using LLMs

  • The course manual is clear on this topic
  • “…if you don’t understand what you’re asking for, you won’t know if the answer is right.”
  • Can I check if you used LLMs? No.
  • But latest during your thesis writing and defense, it will show if you don’t understand what you did.

Setting the Stage

Core Skills 2025 - World Economic Forum

Future of Jobs Report2025

Organizing Research

Organizing Research

Your future self will thank you!

Organizing Research

Your future self will thank you!

Project structure

Key principles (adjust to your needs)

  • Raw data is sacred—never modify it directly
  • Running order, e.g. number your scripts in execution order
  • One script = one purpose
  • Document every decision that involves judgment in the code with comments

Organizing Research

Reproducibility & Quarto

Thesis requirements

  • Goal: Replicability of the scientific results
  • All steps from raw data to final results must be documented
  • MSc thesis ideally one source file (e.g., “.qmd” file)
    • Includes code, output, and narrative (references, etc.)
  • More details in the thesis manual!

Assignments

  • Assignments are on JupyterHub
    • “.ipynb” files (Jupyter Notebooks)
    • Run code cells interactively in the browser
  • Follow same principles as for thesis work
    • They combine code, output, and narrative
    • Run from top to bottom to reproduce results
  • Assignments mimick thesis workflow as closely as possible

Organizing Research

Stata, Python, and Quarto

Data Structure & Tidy Data

Data Structure & Tidy Data

What is tidy data?

Definition

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

Why is tidy data important?

  • Almost every statistical command expects tidy data
  • (Data) Cleaning means getting from raw to tidy data
  • Easier to understand, document, and share

Data Structure & Tidy Data

Tidy versus “messy” data

“Messy” data (wide format)

Student ID Score_22 Score_23 Score_2024
S001abc 85 90.0 92.0
S002_ 78.0 81 85
S 003 92 95 98.0

Tidy data (long format)

Student ID Year Score
S001 2022 85
S001 2023 90
S001 2024 92
S002 2022 78
S002 2023 81
S002 2024 85
S003 2022 92
S003 2023 95
S003 2024 98

Data Structure & Tidy Data

(Typical) Finance data structures

  • Cross-sectional: One observation per firm (snapshot)
  • Time series: Multiple observations over time for one firm
  • Panel data: Multiple firms × multiple time periods → firm-year observations

Most of your thesis work will use panel data. We’ll return to this in Lecture 4

Data Structure & Tidy Data

What else is there?

Other data structures

  • Hierarchical / nested data
    • XBRL data
    • JSON data
    • Lists in R/Python
    • Dictionaries in Python
  • Spatial data
    • Shapefiles, GeoJSON
    • GIS software (ArcGIS, QGIS)

Database types

  • Relational databases
    • SQL databases, e.g. PostgreSQL (WRDS), MySQL (backend of many web apps)
    • NoSQL databases, e.g. MongoDB (document storage)
  • Time-series databases
    • InfluxDB, TimescaleDB

These are beyond the scope of this course!

Data Structure & Tidy Data

Each one teach one: Where to get finance data?

Key data sources

WRDS (Wharton Research Data Services)
  • Prices: CRSP (US), Compustat Global, and Datastream/LSEG (global)
  • Intraday prices: TAQ
  • Listed firms’ fundamentals: CRSP, Compustat, Bureau van Dijk
  • Private firms: Bureau van Dijk (Orbis, Amadeus)
  • Institutional ownership: Thomson Reuters Institutional (13F) Holdings
  • Board and Executives: BoardEx, ExecuComp, Incentive Lab
  • Analyst forecasts: I/B/E/S
  • Auditor data: Audit Analytics
  • 10-K/SEC filings: EDGAR
  • ESG data: MSCI ESG, Sustainalytics
Library & other sources
  • Mutual funds, ETFs, etc.: Morningstar Direct: Mutual funds, ETFs, etc.
  • Mergers & Acquisitions: SDC Platinum
  • Text data:
    • REddit (via API)
    • SEC filings (via EDGAR or WRDS)
    • News data (via Bloomberg, Refinitiv, or other providers)
  • Macroeconomic data: FRED, World Bank, IMF, OECD

Data Structure & Tidy Data

Each on teach one: How to access Data?

Access via Library

Accessing WRDS

WRDS via web interface
  • Point-and-click data extraction
  • Export to CSV, Stata, SAS, R, Python
  • Good for one-off downloads
WRDS via API
  • Programmatic access via Python, R, Stata (and SAS)
  • Good for reproducible research workflows
  • See WRDS documentation for details

Merging & Identifiers

Merging & Identifiers

Why do we care?

  • Real-world data is messy and fragmented
  • Different databases use different identifiers
  • Merging datasets is essential for comprehensive analysis
  • Understanding identifiers helps avoid merge errors and data inconsistencies

Most important step to get from raw data to analysis-ready data!

Merging & Identifiers

The Identifier Zoo

Common Identifiers Across Major Finance Databases
Source CRSP Compustat Thomson Refinitiv
CRSP Linking within the CRSP Environment CRSP/Compustat Merged (CCM) MFLINKs for Mutual Fund Linking; Linking through CUSIP; 13F Linking Program
Compustat CRSP/Compustat Merged (CCM); Using Compustat Historical Identifiers Linking within the Compustat Environment using GVKEY Link through CUSIP
Capital IQ CIQ company ID linked to GVKEY and then to CRSP Linking within Capital IQ Link through CUSIP, ISIN or Ticker
IBES IBES CRSP Link Link through CRSP Link through Security Mapping
Audit Analytics Link through Compustat Link through CIK Link (primary); Link through Ticker (secondary) Link these two products using Compustat
BoardEx Link with CRSP and Compustat Link with CRSP and Compustat TR Insiders Boardex Link
Bond: TRACE / Mergent FISD Bond CRSP Link Link through CRSP Link through CRSP
Bureau van Dijk Link through ISIN (CUSIP) for US Companies Link through CRSP Link through ISIN or SEDOL

Source: WRDS Linking Matrix

Merging & Identifiers

The Identifier Zoo

Why so many?

  • Different databases, different purposes, historical reasons
  • CRSP uses PERMNO, Compustat uses GVKEY, SEC filings use CIK, everyone uses CUSIP (but which version?)
Identifier Source1 What it identifies Persistence
CUSIP Compustat Security Can change!
GVKEY Compustat Company Stable
PERMNO CRSP Security Stable
ISIN Refinitiv Security Can change!
TICKER Refinitiv Security Can change!

Merging & Identifiers

The Identifier Zoo

Why so many?

  • Different databases, different purposes, historical reasons
  • CRSP uses PERMNO, Compustat uses GVKEY, SEC filings use CIK, everyone uses CUSIP (but which version?)

Key insights

  • Always check which identifier(s) your datasets use before merging!
  • Check WRDS to find linking tables between identifiers (e.g., CRSP-Compustat link)
  • Saves a ton of work and headache later!

Merging & Identifiers

Merge Types

Fundamentals

1:1 Merge (a.k.a. inner-join)

What to watch out for?
  • Ensure both datasets have unique keys
  • Check for unexpected duplicates before merging
  • Non-matches: Decide how to handle them (drop)

Merging & Identifiers

Merge Types

Fundamentals

1:m Merge (a.k.a. left-join)

m:1 Merge (a.k.a. right-join)

Merging & Identifiers

Merge Types

What to watch out for?

  • Ensure both datasets have unique keys
  • Check for unexpected duplicates before merging
  • Understand the direction of the merge (which dataset is “master”)
  • Decide how to handle non-matches (keep/drop)
    • Keep all from “master” dataset

Merging & Identifiers

Merge Types

Fundamentals

m:m Merge

What to watch out for?
  • Avoid m:m merges whenever possible!
  • Usually indicates a problem with the data or merge keys
  • Can lead to data explosion and incorrect results
  • If unavoidable, carefully check the results and understand implications

Merging & Identifiers

Example

Merging & Identifiers

Example

Merging & Identifiers

Reality check

Best practices for merging data

  • Check uniqueness before merging
  • Check match rates after merging
  • Investigate unmatched observations—are they expected?

Examples

  • Merging firm fundamentals (Compustat) with stock prices (CRSP)
    • Use CRSP-Compustat link table (gvkey ↔︎ permno)
      • Or download the merged dataset directly from WRDS!😉
    • One gvkey (Compustat) per firm to multiple daily prices (1:m) per year
    • Left-join: Keep all firm-(years) from Compustat, add prices from CRSP

Dieselgate Teaser & Assignment preview

VW Dieselgate

What happened?

Still making headlines today

September 2015

September 3

VW engineers privately admitted to the US Environmental Protection Agency that 480,000 diesel cars had been fitted with illegal defeat devices to understate emissions

September 18

EPA publicly announced it had found software-based defeat devices on model year 2009-2015 VW and Audi diesel cars with 2-litre engines

September 23

Martin Winterkorn resigned as CEO, stating “I am not aware of any wrongdoing on my part”

Ask FT (2026)

VW Dieselgate

The timeline

VW Dieselgate

Why this matters for finance research and pratice

Research implications/opportunities

  • Event study opportunity (Lecture 5 preview)
  • Cross-firm contagion (did other automakers suffer?)
  • Data challenges: Matching news to stock prices, firms to industries, prices to accounting data

Practical implications

  • Risk management: How to price in such events?
  • Regulatory oversight: Detecting fraud early
  • Corporate governance: Preventing future scandals
  • Legal consequences: Class-action lawsuits, fines

VW Dieselgate

What does it have to do with the assignments?

Assignments 1-5 use VW Dieselgate as a running example…

  • Assignment 1: Data wrangling & visualization
  • Assignment 2: visualization
  • Assignment 3: Cross-sectional regression analysis (OLS)
  • Assignment 4: Panel data regression analysis (DiD)
  • Assignment 5: Event study analysis

VW Dieselgate

Assignment 1 walkthrough

The skinny

Data
  • WRDS Compustat + CRSP
  • auto_firms_raw.csv: Company identifiers and industry codes from Compustat
  • auto_firms_prices.csv: Daily stock prices for automakers
  • All data provided in the assignment folder on JupyterHub
Your tasks
  • Clean the header file (filter to relevant firms—German automakers including VW)
  • Merge price data to get one analysis-ready “tidy” dataset
  • Create basic variables (returns, log prices) and publication-quality tables
  • Teaser lecture 2: Time-series plot of VW’s stock price with event annotation

All details in the assignment document.

Let’s check it out!

JupyterHub

Thank You for Your Attention!

See You in the Next One!

References

Huntington-Klein, Nick. 2022. The Effect: An Introduction to Research Design and Causality. 2nd ed. Chapman; Hall/CRC.
Verbeek, Marno. 2021. Panel Methods for Finance: A Guide to Panel Data Econometrics for Financial Applications. De Gruyter.

Appendix

Appendix

Wrds Web Interface example

Accessing WRDS via web interface - Compustat